The 03_scrna_seq.Rmd file contains a common pipeline for analysing single-cell RNA-seq data using the Seurat package.
This workshop leans heavily on the tutorials used in the Seurat package https://satijalab.org/seurat/articles/get_started.html. Also, I have adapted content from the single cell workshop given at the IGMM from Alan O’Calaghan and Catalina Vallejos.
Here we will go through some basic concepts in scRNAseq data processing, chiefly pre-processing, QC, normalisation, and selection of variable genes. We’ll also show how to perform dimensionality reduction, clustering, and differential expression analysis within Seurat. Next we will use tailored methods to perform cell type classification using the scPred method (see 04_srna_clasification.Rmd file)
To run a whole R chunk, click on the corresponding green arrow Run Current chunk. To run a specific line within each chunk, place the cursor in that line and then press Ctrl + Enter (Windows) or Command + Enter Mac. Or click on the Run button at the top right of the panel.
knitr::opts_chunk$set(
echo = TRUE,
fig.path = "out_scrna/",
fig.width = 12,
fig.height = 8,
cache.path = ".cache/"
)library(Seurat)## Attaching SeuratObject
library(dplyr) ##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(patchwork)
library(purrr)We will work with the 10X Genomics PBMC dataset, consisting of 2,700 single cells sequenced on the Illumina NextSeq 500. A peripheral blood mononuclear cell (PBMC) is any peripheral blood cell having a round nucleus. These cells consist of lymphocytes (T cells, B cells, NK cells) and monocytes, whereas erythrocytes and platelets have no nuclei.
Here we will ignore the whole low-level processing of the scRNA-seq data, and work directly with the UMI count matrix, which is the output from cellRanger pipeline from 10X.
The data can be found here, however, we have already downloaded the dataset and stored it in the data folder. (NOTE Set RStudio to the current working directory, see https://www.ucl.ac.uk/~uctqiax/PUBLG100/2015/faq/setwd.html).
We use the Read10X function to load the count matrix in R.
pbmc_counts <- Read10X(data.dir = "data/filtered_gene_bc_matrices/hg19/")Let’s have a look at its structure. Due to the large amount of non-expressed genes, we generally store scRNA-seq data in sparse matrices.
# Rows denote number of genes and columns number of cells
dim(pbmc_counts)## [1] 32738 2700
# Show 11 genes (from 50-60th row) and 11 cells (from 60-70th column)
pbmc_counts[50:60, 60:70]## 11 x 11 sparse Matrix of class "dgCMatrix"
## [[ suppressing 11 column names 'AACCGCCTAGCGTT-1', 'AACCGCCTCTACGA-1', 'AACCTACTGTGAGG-1' ... ]]
##
## TTLL10-AS1 . . . . . . . . . . .
## TTLL10 . . . . . . . . . . .
## TNFRSF18 . . . . . . . . . . .
## TNFRSF4 2 . . . . . . . . . 1
## SDF4 . . 1 1 . . . . . . .
## B3GALT6 . 2 . . . . . . 1 . .
## FAM132A . . . . . . . . . . .
## RP5-902P8.12 . . . . . . . . . . .
## UBE2J2 2 . . . . . . . . . .
## RP5-902P8.10 . . . . . . . . . . .
## SCNN1D . . . . . . . . . . .
We can observe that the scRNA-seq data are in general really sparse!
S
# We can select specific genes of interest by their name
pbmc_counts[c("CD3D", "TCL1A", "MS4A1"), 1:30]## 3 x 30 sparse Matrix of class "dgCMatrix"
## [[ suppressing 30 column names 'AAACATACAACCAC-1', 'AAACATTGAGCTAC-1', 'AAACATTGATCAGC-1' ... ]]
##
## CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5
## TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . .
## MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .
These are marker genes for peripheral blood cells, hence we observe that they have high expression, compared to the random choice of genes in the previous example.
Next we create a Seurat object, which will make it easier to perform QC and downstream analysis. When creating the object we will also perform some broad filtering to remove really low quality cells and genes. As you can see from 32,738 genes, we are now left with 13,714.
# We keep all genes expressed in >= 3 cells and also keep
# all cells with at least 200 detected genes.
pbmc <- CreateSeuratObject(counts = pbmc_counts, project = "pbmc3k",
min.cells = 3, min.features = 200)## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
pbmc## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
The raw counts data are stored in the following slot.
pbmc@assays$RNA@counts[50:55, 60:70]## 6 x 11 sparse Matrix of class "dgCMatrix"
## [[ suppressing 11 column names 'AACCGCCTAGCGTT-1', 'AACCGCCTCTACGA-1', 'AACCTACTGTGAGG-1' ... ]]
##
## C1orf86 . . 1 . . . 1 . 2 1 .
## AL590822.2 . . . . . . . . . . .
## SKI 1 . . . . . . . . . .
## RER1 . 1 1 2 . . 1 . 3 1 1
## PEX10 . . . . . . . . . . .
## PLCH2 . . . . . . . . . . .
# Seurat keeps also object metadata for each cell, where we can use to
# store QC metrics and analysis output, such as cluster assignments.
# We can access these metadata as follows for the first 10 cells.
# nCount_RNA corresponds to total number of UMI counts
# nFeature_RNA total number of genes that are expressed in the specific cell.
pbmc@meta.data[1:10, ]## orig.ident nCount_RNA nFeature_RNA
## AAACATACAACCAC-1 pbmc3k 2419 779
## AAACATTGAGCTAC-1 pbmc3k 4903 1352
## AAACATTGATCAGC-1 pbmc3k 3147 1129
## AAACCGTGCTTCCG-1 pbmc3k 2639 960
## AAACCGTGTATGCG-1 pbmc3k 980 521
## AAACGCACTGGTAC-1 pbmc3k 2163 781
## AAACGCTGACCAGT-1 pbmc3k 2175 782
## AAACGCTGGTTCTT-1 pbmc3k 2260 790
## AAACGCTGTAGCCA-1 pbmc3k 1275 532
## AAACGCTGTTTCTG-1 pbmc3k 1103 550
Seurat allows us to easily explore cell QC metrics and filter cells based on user defined criteria. As seen, we can attempt to identify barcodes (cells) containing only ambient RNA. We can also filter cells based on the number of features they express (in other words, their complexity), their library size, and their mitochondrial gene proportion.
Barcodes with very low complexity or library size relative to all others may represent empty droplets or cells with very low capture efficiency. Barcodes with particularly high complexity or library size may indicate doublets, but they may also be normal cells. Some packages advocate dropping cells that exceed an upper bound on complexity or library size; however, there exist more sensitive ways of identifying doublets (we will not cover those in this tutorial, but you can check the scrublet package).
Mitochondrial genes are often used as a proxy for cell quality. When cells are stressed and/or lysed before being encapsulated in droplets, cytoplasmic RNA is often lost. In these cases, mitochondrial genes are present at elevated levels. Thus, it is important to calculate the proportion of mitochondrial genes, to remove cells with especially high mitochondrial gene proportions, and to account for it in downstream analysis. Below we compute the mitochondrial percentage and store it as additional column.
# The [[ operator can add columns to object metadata.
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc@meta.data[1:5, ]## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759
## AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958
## AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363
## AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845
## AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898
Next we can visualise the QC metrics and make data driven decision on the thresholds.
# Visualize QC metrics independently
VlnPlot(pbmc, features = c("nFeature_RNA", "percent.mt"), ncol = 2)# We can also use the FeatureScatter function to visualize feature-feature relationships.
plot1 <- FeatureScatter(pbmc, feature1 = "nFeature_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1Based on these plots, we can then filter low quality cells, essentially removing around 60 cells.
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc## An object of class Seurat
## 13714 features across 2638 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
After removing low quality genes/cells, we are ready to perform normalisation, mostly in order to remove the effect of library size. Here we employ the simplest strategy and perform global-scaling normalization (method LogNormalize in Seurat) that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.
The division by total expression is done to change all expression counts to a relative measure, since experience has suggested that technical factors (e.g. capture rate, efficiency of reverse transcription) are largely responsible for the variation in the number of molecules per cell. The log-transformation is a commonly used transformation that has many desirable properties, such as variance stabilization.
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)pbmc@assays$RNA@data[50:60, 1:5]## 11 x 5 sparse Matrix of class "dgCMatrix"
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
## C1orf86 . . .
## AL590822.2 . . .
## SKI . . .
## RER1 . 1.111715 1.429744
## PEX10 . . .
## PLCH2 . . .
## PANK4 . . 1.429744
## RP3-395M20.12 . 1.111715 .
## TNFRSF14 . . .
## RP3-395M20.9 . . .
## FAM213B . . .
## AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
## C1orf86 . .
## AL590822.2 . .
## SKI . .
## RER1 1.566387 .
## PEX10 . .
## PLCH2 . .
## PANK4 . .
## RP3-395M20.12 . .
## TNFRSF14 1.566387 .
## RP3-395M20.9 . .
## FAM213B . .
Generally with scRNA-seq data, we are interested in characterising heterogeneity between cells. In order to do this, it is helpful to select a subset of genes that contain the major components of biological variation, and ignore genes that contain only random noise. This also improves the computational efficiency of downstream analyses.
One approach to this is to select the most variable genes across the population. The assumption here is that biological variation for a subset of genes will lead to increased variation relative to genes driven only by technical noise and “uninteresting” biological variation (e.g., transcriptional bursting). There are a number of methods for quantifying per-gene variation and selecting highly variable genes (HVGs) on that basis.
FindVariableFeatures calculates the average expression and dispersion for each gene, places these genes into bins, and then calculates a z-score for dispersion within each bin. This helps control for the relationship between variability and average expression. Here we keep the top 2,000 variable genes.
# Obtain variable genes
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)## When using repel, set xnudge and ynudge to 0 for optimal results
plot2## Warning: Transformation introduced infinite values in continuous x-axis
After feature selection, we will also scale the normalised expression data, that is we will mean center the data, i.e. subtract the mean. We will also divide by the standard deviation to make everything to a ‘standard normal’, where the mean is zero and the standard deviation is 1. We perform this step since dimensionality reduction techniques, like PCA, require scaled data.
pbmc <- ScaleData(pbmc)## Centering and scaling data matrix
pbmc@assays$RNA@scale.data[1:5, 1:5]## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## ISG15 -0.83530282 -0.83530282 0.39223510 2.21976210
## CPSF3L -0.27411451 -0.27411451 -0.27411451 -0.27411451
## MRPL20 1.50566156 -0.56259931 1.24504892 -0.56259931
## ATAD3C -0.04970561 -0.04970561 -0.04970561 -0.04970561
## C1orf86 -0.46390647 -0.46390647 -0.46390647 -0.46390647
## AAACCGTGTATGCG-1
## ISG15 -0.83530282
## CPSF3L -0.27411451
## MRPL20 -0.56259931
## ATAD3C -0.04970561
## C1orf86 -0.46390647
dim(pbmc@assays$RNA@scale.data)## [1] 2000 2638
scRNAseq analyses generally involve comparing the expression profiles of many thousands of cells and features. Even after keeping the highly variable features, we are still left with a large number of genes to directly perform downstream analysis. Therefore we seek to reduce the number of features wherever possible. As we discussed, PCA creates new features that are combinations of the existing genes (e.g. genes belonging to the same regulatory programme, often we call these new features as eigengenes).
For those curious, there is a great intuitive explanation of PCA on CrossValidated, especially the following figure from it that clearly demonstrates what PCA does.
Basically, PCA tries to find the axis through the data that explains the most variation. The distance of points from this axis becomes the first principal component (PC). The next axis is the one, orthogonal to the previous, that explains maximal variation (and so on).
Below we run PCA and keep the top 50 principal components (i.e. for downstream analyses we will use at most 50 features), where we hope they have captured most of the variability in our data.
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc),
npcs = 50)## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP
## FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP
## PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD
## Negative: MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW
## CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A
## MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
## HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB
## BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74
## Negative: NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2
## CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX
## TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA
## HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8
## PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B
## Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU
## HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1
## NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A
## SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC
## GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL
## AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7
## LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2
## GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5
## RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX
## Negative: LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1
## PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B
## FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB
pbmc@reductions$pca## A dimensional reduction object with key PC_
## Number of dimensions: 50
## Projected dimensional reduction calculated: FALSE
## Jackstraw run: FALSE
## Computed using assay: RNA
We can create a scree plot to visualise the proportion of variability explained by each PC (Seurat calls it ElbowPlot).
ElbowPlot(object = pbmc, ndims = 30, reduction = "pca")For these data, we observe that most of the cell-to-cell variability is explained from the first 10 PCs.
We can plot the first 2 PCs and then colour each cell (each dot corresponds to a single cell) according to number of features and the mitochondrial percentage. There seems to be a slight correlation of PC2 with number of detected features (within each cluster), but in general it seems our normalisation strategy has removed the effect of sequencing depth (technical variability).
We also observe that according to the first 2 PCs there seem to be three major cell sub-populations in our data.
FeaturePlot(pbmc, dims = c(1, 2), features = c("nFeature_RNA"))We can use heatmaps to explore the primary sources of heterogeneity in our dataset. Both cells and features are ordered according to their PCA scores. We set cells = 200 to plot the extreme cells on both ends of the spectrum, which will help us identify correlated feature sets.
DimHeatmap(pbmc, dims = 1:4, cells = 300, balanced = TRUE, ncol = 2)Clustering is one of the most commonly used procedures with scRNA-seq data. It is often employed to discover or identify sub-populations of cells that correspond to interesting biological functions or classes. This is done by grouping together cells with similar expression profiles. It’s important to note that this is a process of assigning discrete labels to what is truly complex, continuous, high-dimensional data. Consequently, there is no “true” clustering in most cases - we simply seek to create a simplified representation of the data that allows us to easily understand and interrogate its structure.
Graph-based clustering, notably used by Seurat, is a flexible and fast technique for clustering high-dimensional data. First, we construct a graph where each node is a cell, connected to its K nearest neighbours in the original high-dimensional space. Here, K is a very important parameter, as it controls the resolution of the graph. A small value for K yields a more sparse graph and a more fine-grained clustering, while higher K results in an interconnected graph and broader clustering. It’s usually wise to experiment with the value of K and settle on a broadly useful resolution.
Edges between nodes are weighted based on the similarity of the cells involved. We then apply community-detection techniques, identifying groups of cells that are more connected to each other than to cells of other communities. These methods are highly scalable, but are also highly dependent on the density of the data. High density regions (highly similar cells) are likely to be “split” into several clusters, due to the number of steps needed to traverse these dense regions of the graph.
# Identify neighbours in the graph. We use the top 10 PCs as per the elbow plot
pbmc <- FindNeighbors(pbmc, dims = 1:10)## Computing nearest neighbor graph
## Computing SNN
# You could play with the resolution and check how the number of clusters changes.
pbmc <- FindClusters(pbmc, resolution = 0.5)## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 95965
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8723
## Number of communities: 9
## Elapsed time: 0 seconds
# The clustering commands we used above added another column,
# named `seurat_clusters`, which keeps the assignment of each
# cell to a cluster.
DimPlot(pbmc, dims = c(1, 2), group.by = "seurat_clusters", reduction = "pca")This seems quite good, however we would want to visualise the clustering of cells based on all principal components not the first two. To do so, we will perform non-linear dimensionality reduction of the 10 PCs, and visualise the results in 2 dimensions.
So remember the process:
t-distributed stochastic neighbor embedding (t-SNE) is a dimensionality reduction technique that attempts to create a low-dimensional representation that captures the neighbourhood relationships between the data. It is not restricted to linear combinations of features, as PCA is. This allows t-SNE to capture more complex structure within the data, and it can provide a useful way of visualising these structures in a way that is easily understood.
However, t-SNE has a number of disadvantages. Firstly, it does not necessarily preserve global distances in, meaning that the separation between two clusters does not necessarily have meaning. Another disadvantage is the strong dependence of the on the initialisation and a perplexity parameter. The latter controls the balance between local and global structure. Low values mean that the embedding only communicates information about each point’s near neighbours, while high values mean that the embedding seeks to capture global structure. It is important to note that t-SNE components are generally not considered to be suitable for use in cluster analysis or other quantitative methods.
There is a great interactive document here from Google employees demonstrating how the properties of the data and the hyperparameters of t-SNE can impact the resulting visualisations.
# Run t-SNE
pbmc <- RunTSNE(pbmc, dims = 1:10, reduction = "pca")# Show cluster labels in tSNE space
DimPlot(pbmc, dims = c(1, 2), group.by = "seurat_clusters", reduction = "tsne")Uniform manifold approximation and project (UMAP) is another non-linear dimensionality reduction technique. It is considered by some to be better at capturing the global structure of the data, but in many ways is similar to t-SNE.
pbmc <- RunUMAP(pbmc, dims = 1:10, reduction = "pca")## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 10:45:07 UMAP embedding parameters a = 0.9922 b = 1.112
## 10:45:07 Read 2638 rows and found 10 numeric columns
## 10:45:07 Using Annoy for neighbor search, n_neighbors = 30
## 10:45:07 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 10:45:08 Writing NN index file to temp file /var/folders/w7/mh3jgr7915d7l91bmvsn6qq00000gp/T//RtmpOJSVBz/file16e36c12dd6e
## 10:45:08 Searching Annoy index using 1 thread, search_k = 3000
## 10:45:09 Annoy recall = 100%
## 10:45:09 Commencing smooth kNN distance calibration using 1 thread
## 10:45:09 Initializing from normalized Laplacian + noise
## 10:45:09 Commencing optimization for 500 epochs, with 105124 positive edges
## 10:45:13 Optimization finished
pbmc@reductions$umap## A dimensional reduction object with key UMAP_
## Number of dimensions: 2
## Projected dimensional reduction calculated: FALSE
## Jackstraw run: FALSE
## Computed using assay: RNA
# Show cluster labels in UMAP space
DimPlot(pbmc, dims = c(1, 2), group.by = "seurat_clusters", reduction = "umap")Once we have identified discrete labels that we are happy with, we would like to identify genes that differ between our clusters. By default, Seurat identifes positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.
# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
# The %>% command is called 'pipe' and is a way to write code more compactly
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)## Registered S3 method overwritten by 'cli':
## method from
## print.boxx spatstat.geom
## # A tibble: 18 × 7
## # Groups: cluster [9]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 1.74e-109 1.07 0.897 0.593 2.39e-105 0 LDHB
## 2 1.17e- 83 1.33 0.435 0.108 1.60e- 79 0 CCR7
## 3 0 5.57 0.996 0.215 0 1 S100A9
## 4 0 5.48 0.975 0.121 0 1 S100A8
## 5 7.99e- 87 1.28 0.981 0.644 1.10e- 82 2 LTB
## 6 2.61e- 59 1.24 0.424 0.111 3.58e- 55 2 AQP3
## 7 0 4.31 0.936 0.041 0 3 CD79A
## 8 9.48e-271 3.59 0.622 0.022 1.30e-266 3 TCL1A
## 9 1.17e-178 2.97 0.957 0.241 1.60e-174 4 CCL5
## 10 4.93e-169 3.01 0.595 0.056 6.76e-165 4 GZMK
## 11 3.51e-184 3.31 0.975 0.134 4.82e-180 5 FCGR3A
## 12 2.03e-125 3.09 1 0.315 2.78e-121 5 LST1
## 13 1.05e-265 4.89 0.986 0.071 1.44e-261 6 GZMB
## 14 6.82e-175 4.92 0.958 0.135 9.36e-171 6 GNLY
## 15 1.48e-220 3.87 0.812 0.011 2.03e-216 7 FCER1A
## 16 1.67e- 21 2.87 1 0.513 2.28e- 17 7 HLA-DPB1
## 17 7.73e-200 7.24 1 0.01 1.06e-195 8 PF4
## 18 3.68e-110 8.58 1 0.024 5.05e-106 8 PPBP
A more visually appealling approach would be to use heatmaps, where we will group cells (columns) together and show the expression levels of the top marker genes per cluster.
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()## Warning in DoHeatmap(pbmc, features = top10$gene): The following features were
## omitted as they were not found in the scale.data slot for the RNA assay: CD8A,
## VPREB3, PIK3IP1, PRKCQ-AS1, NOSIP, LEF1, CD3E, CD3D, CCR7, LDHB, RPS3A
VlnPlot() (shows expression probability distributions across clusters), and FeaturePlot() (visualizes feature expression on a tSNE or PCA plot) are the most commonly used visualizations for feature expression. Let’s use some of the marker genes identified from the above analysis.
VlnPlot(pbmc, features = c("MS4A1", "CCL5", "TCL1A", "GZMK", "CCR7", "CD79A"))FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A",
"FCGR3A", "LYZ", "PPBP", "CD8A"))Finally, from the marker genes we can annotate each cluster to a known cell type.
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B",
"CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()